Environmental Agroregions

The decision of cultivating agricultural crops is highly associated to the precipitation regime, temperature regime, and soil quality of the region. However, these varaibles typically exhibit different spatial patterns, creating different zones or sub-regions with unique properties and yield potential.

In this exercise we will use three layers of information to divide the state of Kansas into nine agroregions. The number of 9 regions was chosen arbitrarity, but it matches the current number of ecoregions for the state.

The selected variables are:

  • 30-year annual cumulative thermal units: Growing degree days or thermal units are highly correlated with crop growth and development. To avoid accounting for winter and summer crops, we will use the annual sum of mean air temperature above zero degrees Celsius as indicator of the thermal regime across the state. Using cumulative thermal units will also enhance regional differences compared to using other varaibles such as annual mean temperature, which may result in small differences across regions. This variable exhibits a North-South gradient, with portions in the southerm part of the state exposed to higher temperatures, and therefore greater number of cumulative thermal units. Source layer: Parameter elevation Regression on Independent Slopes Model (PRISM)

  • 30-year annual precipitation: Precipitation is arguably one of the most important factors conditioning vegetation growth. The state of Kansas exhibits a strong East-West gradient, with eastern portions of the state receiving about three times more precipitation per year (1200 mm per year) than regions in western Kansas (400 mm per year). Source layer: Parameter elevation Regression on Independent Slopes Model (PRISM)

  • 1-km sand content at 5 cm depth: Soil fertility and soil water storage and major drivers of vegetation. Agricultural crops are often planted in rich, fertile soils. The percetnage of sand content in the top 5 cm is just a simple proxy variable to indicate contrasting soil types. This is also the most common sampling layer, which means that this gridded surface layer is probably more accurate than gridded products of at deeper soil layers. Generally speaking, sandy soils have low fertility and cannot hold water in teh pore space for very long, so unless there is supplemental irrigation, these soils would not be able to sustain high-yield crops. In the state of Kansas, sandy soils are typically located on top of shallow (e.g. Equus beds) or deep aquifers (Ogallala aquifer). Source: SoilGrids, International Soil Reference and Information Centre.

Both the temperature and precipitation regime are coarse regulators of potentially viable regions for agricultural crops, while soil type is a fine regulator.

This exercise will require the following modules typically not included in the Anaconda package:

  • Rasterio: pip install rasterio

  • Xarray: conda install xarray

from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
import numpy as np
import holoviews as hv
hv.extension('bokeh')
import xarray as xr
# Read geotiff layers
gdd = xr.open_rasterio('../datasets/clustering/gdd_30_years.tif')
precip = xr.open_rasterio('../datasets/clustering/precip_30_years.tif')
sand = xr.open_rasterio('../datasets/clustering/sand_0_5_cm.tif')
# Plot each layer to visualize spatial gradients
gdd_map = hv.Image( (gdd.x, gdd.y, gdd.values[0]) ).opts(title='Annual Thermal Units',
                                                         frame_width=200,
                                                         frame_height=200,
                                                         cmap='RdBu_r',
                                                         aspect='equal',
                                                         colorbar=True)

precip_map = hv.Image( (precip.x, precip.y, precip.values[0]) ).opts(title='Annual Precipitation (mm)',
                                                                     frame_width=200,
                                                                     frame_height=200,
                                                                     cmap='YlGnBu',
                                                                     aspect='equal',
                                                                     colorbar=True)

sand_map = hv.Image( (sand.x, sand.y, sand.values[0]) ).opts(title='Percent Sand Content in 0-5 cm',
                                                             frame_width=200,
                                                             frame_height=200,
                                                             cmap='PuOr_r',
                                                             aspect='equal',
                                                             colorbar=True)

(gdd_map + precip_map + sand_map).opts(shared_axes=False).cols(3)

Preparing data for clustering

The clustering routine in Scikit-learn cannot handle missing values (e.g. NaN). Because our maps contain NaN values to indicate pixels outside of the state or pixels over water bodies (see sand map), then we need to assign a placeholder to these pixels. Rather that assigning zero, we will assign a negative value, which will make place these pixels in their own group that we can ignore after the clustering.

The other condition is the normalizing of the variables. Because differen varaibles span a differen range, the relative magnitude could bias the clusterign result.

# Convert 2D arrays into 1D arrays
X_gdd = gdd.values[0].flatten()
X_precip = precip.values[0].flatten()
X_sand = sand.values[0].flatten()

# Find grid cells with NaN values in any layer
idx_nan = np.isnan(X_gdd) | np.isnan(X_precip) | np.isnan(X_sand)

# Replace the NaN values with -99
X_gdd[idx_nan] = -99
X_precip[idx_nan] = -99
X_sand[idx_nan] = -99

X_normalized = normalize([X_gdd,X_precip,X_sand], axis=1)

# Alternative 2
#X_gdd = normalize([X_gdd])
#X_precip = normalize([X_precip])
#X_sand = normalize([X_sand])

# Alternative 3
#X_gdd = (X_gdd - np.nanmean(X_gdd))/np.nanstd(X_gdd)
#X_precip = (X_precip - np.nanmean(X_precip))/np.nanstd(X_precip)
#X_sand = (X_sand - np.nanmean(X_sand))/np.nanstd(X_sand)
# Check that the X array with the variables has the right dimensions
print(X_normalized.shape)
print(np.transpose(X_normalized).shape)
(3, 324714)
(324714, 3)
# Cluster the data
k = 5 + 1 # Add an extra cluster to filter out the grid cells out of the state or on water bodies
clusters = KMeans(n_clusters=k).fit(np.transpose(X_normalized))
# Check type of a single value of the resulting cluster labels
print(type(clusters.labels_[0]))

# Convert to float so taht we can bring back the NaN values and mask the resulting image
cluster_labels = clusters.labels_.astype(np.float64)
<class 'numpy.int32'>
cluster_labels[idx_nan] = np.nan
print(cluster_labels.shape)
(324714,)
# Reshape the cluster labels so that they are in a 2D array to restore the shape of the original maps.
cluster_labels = np.reshape(cluster_labels, gdd.values[0].shape)
print(cluster_labels.shape) # Check that we changed the shape of the array
(362, 897)
# Plot map (use the x and y coordinates from another layer since the cluster labels don't have coordinates)
hv.Image( (sand.x, sand.y, cluster_labels) ).opts(colorbar=True,
                                                  cmap='Spectral_r',
                                                  aspect='equal',
                                                  xlabel='Longitude',
                                                  ylabel='Latitude',
                                                  frame_width=400)

Create summary of properties for each regions

Now that we have divided the state in regions of more homogeneous conditions, let’s generate a summary table with the median values of each property for each zone.

# Find a way to get unique labels in the map
labels = np.unique(cluster_labels[~np.isnan(cluster_labels)])
print(labels)
[0. 2. 3. 4. 5.]

Note

A missing label is likely due to the placeholder for NaN values. This label was removed when we brought the NaN values back to the image.

# Iterate over each label and print the median values for each layer
for label in labels:
    idx = cluster_labels == label
    print(label,'=>',
          'GDD :',round(np.median(gdd.values[0][idx])),'C per year', 
          ' PRECIPITATION :',round(np.median(precip.values[0][idx])),'mm', 
          ' SAND :',round(np.median(sand.values[0][idx])),'%')
0.0 => GDD : 4979 C per year  PRECIPITATION : 845 mm  SAND : 29 %
2.0 => GDD : 4444 C per year  PRECIPITATION : 495 mm  SAND : 34 %
3.0 => GDD : 4314 C per year  PRECIPITATION : 576 mm  SAND : 23 %
4.0 => GDD : 4689 C per year  PRECIPITATION : 952 mm  SAND : 15 %
5.0 => GDD : 4879 C per year  PRECIPITATION : 679 mm  SAND : 50 %

Practice

  • Partition the state of Kansas using a different number of regions. What type of regions can be detected by using a higher number of clusters? What happens if you only divide the state into 3 regions?

  • Which of the three variables would dominated the clustering process if the input variables are not normalized?

  • Can you run a similar analysis for your own state?